import pandas as pd
# Must put client ID / client Secret in different text file to not compromise information
import sys
sys.path.append('/home/jovyan')
import spotify_key
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
# Call the separate text file info for client ID/ client secret
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=spotify_key.SPOTIPY_CLIENT_ID,
client_secret=spotify_key.SPOTIPY_CLIENT_SECRET))
def call_playlist(username, playlist_id):
#General Pandas df
playlist_audio_features_list = ["artist","album","track_name","track_id","danceability","energy","loudness", "speechiness","acousticness","instrumentalness","liveness","valence","tempo", "duration_ms"]
playlist_df = pd.DataFrame(columns = playlist_audio_features_list)
#Getting Information
results = sp.user_playlist_tracks(username, playlist_id)
tracks = results['items']
while results['next']:
results = sp.next(results)
tracks.extend(results['items'])
results = tracks
for i in range(len(results)):
# Create empty dict
playlist_audio_features = {}
# Get artist, album, track name, track id
playlist_audio_features["artist"] = results[i]["track"]["album"]["artists"][0]["name"]
playlist_audio_features["album"] = results[i]["track"]["album"]["name"]
playlist_audio_features["track_name"] = results[i]["track"]["name"]
playlist_audio_features["track_id"] = results[i]["track"]["id"]
# Get audio features utilizing API
audio_features = sp.audio_features(playlist_audio_features["track_id"])[0]
for feature in playlist_audio_features_list[4:]:
playlist_audio_features[feature] = audio_features[feature]
# Concat the dfs
track_df = pd.DataFrame(playlist_audio_features, index = [0])
playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
# Function commands done
return playlist_df
playlist1 = call_playlist("patrick khoury","5bKOCJ2v5M9KQtv8Po7u62")DH140_Final_Project

Introduction
In the digital Internet era, the advent of streaming services have revolutionized the way that the world is able to interact with and experience music. Beginning in the late 1990s, peer-to-peer music sharing websites like Napster became popular among college-age individuals in the United States, who valued the ability to download music for free and gain access to alternate versions of their favorite songs that were not available via traditional, physical media formats (Brewster, 1.) Although Napster’s existence was short-lived due to copyright issues, its success was noted by the industry as proof of demand for online music sharing. Shortly thereafter, iTunes and Pandora were created to fill this online gap in a way that was more viable for the music industry. Now, the rise of streaming giants such as Apple Music and Spotify have not only altered the dynamics of music consumption itself, but have shifted industry compensation models to heavily factor in payment per stream. Artists have taken note, with some music legends even going so far as to promote their own streaming services, such as Jay-Z’s Tidal. As a result, this dramatic transformation has introduced a new, dominating variable in an artist’s commercial success: song “streamability.”
I conducted an analysis on song popularity in the streaming age of music, with popularity defined as the number of streams. In doing so, I aim to uncover the answer to my principal research question: what key factors might differentiate songs as more “streamable” than others? I accomplish this by examining the top 1525 most streamed songs on Spotify through three different lenses: (1) using Spotify API to pull data on key song attributes such as danceability and tempo; (2) genre; and (3) time era.
In order to examine what might differentiate songs as more “streamable” than others, I further broke my research down into the following sub-question sets: - What is the relationship between key attributes and song genre? What is the relationship between key attributes and song era? - What is the relationship between key attributes and song popularity? - What is the relationship between song genre and song popularity? What is the relationship between song era and song popularity?
Furthermore, upon reviewing the plotted results of the final question set, I felt it would be useful to add the following additional follow up questions: To what extent does a stream cap exist based on genre? Based on era?
Data and Analytical Process
Data
The song attributes that Spotify identifies for each song pulled from their API are outlined below, as defined by Spotify at this link and this link:
- artist: The artist(s) who performed the track
- album: The album on which the track appears
- track_name: The name of the track
- track_id: The Spotify ID for the track. The base-62 identifier found at the end of the Spotify resource identifier.
- danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
- speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
- duration_ms: The track length in milliseconds
The manually-added metrics that I incorporated are outlined as follows:
- Year: The year that the song was released.
- Streams: The cumulative number of streams on Spotify for that song, current as of July 24, 2023 at 11:35 AM.
- Genre: The genre of music for that song.
- Era: The era in which the song was released, grouped into early (X0-X4) and late (X5-X9) half-decades. (e.g. late 2010s, early 2000s)
Link to Sources of data:
Analytical Process
In setting out to answer my principal question, “what key factors might differentiate songs as more ‘streamable?’”, I aimed to examine the number of streams through three lenses: (1) using Spotify API to pull data on key song attributes; (2) genre; and (3) time era. Although Spotify’s API encompasses a wide range of pertinent song attributes, I felt that genre and time era were likely to have an impact on the number of streams, so I opted to add these factors in addition to the factors that Spotify pulls. I broke down my principal question into three distinct steps so that I can better isolate and examine the relationships between each of the factors and their impact on streams.
My first step was to gain a better understanding of the values identified by Spotify’s API by contextualizing them – although their API does well to explain the metrics in words, the metrics can be hard to understand for someone without a working knowledge of music theory. I aimed to take these attributes out of the abstract by applying them to metrics that are more familiar, examining their relationship with song genre and time era. In this way, I was also able to gain an idea of how all three of my lenses integrated with and played off of one another, better setting me up to understand and further hypothesize about my principal research question. For this step, I examined the following sub-questions via box plot visualizations: What is the relationship between key attributes and song genre? What is the relationship between key attributes and song era?
In moving on to the second step, I decided to outline my expectations for findings stemming from my second sub-question: What is the relationship between key attributes and song popularity? For this question, I plotted each key audio feature for each song in the top 1,525 dataset against the total number of streams that song has, allowing me to determine positive or negative correlations between the two metrics through a trend line analysis. A scatter plot is best in this situation, as it allows me to represent all of the data onto one graph and allows for both a visual and analytical means to test the strength of relationship between the two variables.
Moving on to my third step, I again outlined my expectations for findings stemming from the third set of sub-questions: What is the relationship between song genre and song popularity? What is the relationship between song era and song popularity? For these questions, I created two plots for each song genre and song era, plotting against song popularity – I created a box plot to demonstrate the median, upper quartile, lower quartile, minimum, and maximum number of streams; and created a bar plot to demonstrate the mean number of streams, which is not represented in a box plot. I opted to create this second bar plot of the mean so that I can get a clear visual representation of stream numbers across genre and eras that also illustrates the impact of outlier songs. These outliers would either be songs that are low in the stream ranking, or those that are much higher than the rest – viral songs, which are important to answer the principal research question.
Finally, after viewing the plotted results, I added the following follow up questions: To what extent does a stream cap exist based on genre? Based on era? I did this as I noted that although the box plot does include the maximum value, the formatting makes it difficult to compare across. By creating an additional bar plot, I was able to isolate the maximum value and better analyze if certain song genres or eras have a marked “cap” limitation as compared to the others, even if the stream numbers were not that different across genre or era on average.
Importing the Data
Merging Data
# Data Frame 2 with # Of Streams, Year, Genre
import pandas as pd
import numpy as np
Streams_genres_year_df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vRTErHeCPMXDadh2VXqkIO-e0wDRa0WuQbp5IUU31ZTMLuAVM6erHhG8Z5LUOtZRclmN5oq_q7qgzGQ/pub?output=csv')
conditions = [
(Streams_genres_year_df['Year'] >= 2020),
(Streams_genres_year_df['Year'] < 2020) & (Streams_genres_year_df['Year'] >= 2015),
(Streams_genres_year_df['Year'] < 2015) & (Streams_genres_year_df['Year'] >= 2010),
(Streams_genres_year_df['Year'] < 2010) & (Streams_genres_year_df['Year'] >= 2005),
(Streams_genres_year_df['Year'] < 2005) & (Streams_genres_year_df['Year'] >= 2000),
(Streams_genres_year_df['Year'] < 2000) & (Streams_genres_year_df['Year'] >= 1995),
(Streams_genres_year_df['Year'] < 1995) & (Streams_genres_year_df['Year'] >= 1990),
(Streams_genres_year_df['Year'] < 1990) & (Streams_genres_year_df['Year'] >= 1985),
(Streams_genres_year_df['Year'] < 1985) & (Streams_genres_year_df['Year'] >= 1980),
(Streams_genres_year_df['Year'] < 1980) & (Streams_genres_year_df['Year'] >= 1975),
(Streams_genres_year_df['Year'] < 1975) & (Streams_genres_year_df['Year'] >= 1970),
(Streams_genres_year_df['Year'] < 1970) & (Streams_genres_year_df['Year'] >= 1965),
(Streams_genres_year_df['Year'] < 1965) & (Streams_genres_year_df['Year'] >= 1960),
(Streams_genres_year_df['Year'] < 1960) & (Streams_genres_year_df['Year'] >= 1955)
]
values = ['Early 2020s', 'Late 2010s', 'Early 2010s', 'Late 2000s', 'Early 2000s', 'Late 1990s', 'Early 1990s', 'Late 1980s', 'Early 1980s', 'Late 1970s', 'Early 1970s', 'Late 1960s', 'Early 1960s', 'Late 1950s']
Streams_genres_year_df['Era'] = np.select(conditions, values)
# Streams_genres_year_df -- used as check, calling data frame
# Merging Data Frame 1 with Data Frame 2
combine_df = [playlist1, Streams_genres_year_df]
final_df = pd.concat(combine_df, axis=1)Cleaning Data
# Checking for null values:
final_df.isnull().sum()
# Checking for duplicates:
final_df.duplicated().value_counts()
# Dropping the duplicates
final_df.drop_duplicates(inplace=True)
# Dropping data that doesn't have enough representation in data set
# Drop All songs from Late 1950s (4 Values)
final_df = final_df.drop(final_df[final_df['Era']== 'Late 1950s'].index)
# Drop All songs from Early 1960s (5 Values)
final_df = final_df.drop(final_df[final_df['Era']== 'Early 1960s'].index)
# Drop White-noise from dataframe (1 Value)
final_df = final_df.drop(final_df[final_df['Genre']== 'Broadband Noise'].index)
# Drop Jazz from dataframe (1 Value)
final_df = final_df.drop(final_df[final_df['Genre']== 'Jazz'].index)
# Drop Children's Music from dataframe (1 Value)
final_df = final_df.drop(final_df[final_df['track_name']== 'Baby Shark'].index)Relevant Plots and Discussion
Question 1: What is the relationship between key attributes and song genre? What is the relationship between key attributes and song era?
My first sub-question was concerned with the relationship between key attributes and song genre, as well as key attributes and song era. To answer this question, I broke up each song on the data set into its ten defining audio features, and plotted each value against its genre and the era the song was released in, respectively.
The following plots include those from my general Notebook which displayed any mild indication of correlation:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
# Danceability -- low grunge value
px.box(final_df,
x="Genre",
y="danceability",
hover_name='track_name',
color_discrete_sequence=['green'],
labels = dict(danceability = 'Danceability'),
title='Genre vs Danceability')Figure 1: Genre vs Danceability
# Energy -- low folk value
px.box(final_df,
x="Genre",
y="energy",
hover_name='track_name',
color_discrete_sequence=['green'],
labels = dict(energy = 'Energy'),
title='Genre vs Energy')Figure 2: Genre vs Energy
# Speechiness -- high hip-hop/rap value
px.box(final_df,
x="Genre",
y="speechiness",
hover_name='track_name',
color_discrete_sequence=['green'],
labels = dict(speechiness = 'Speechiness'),
title='Genre vs Speechiness')Figure 3: Genre vs Speechiness
# Acousticness -- high folk value
px.box(final_df,
x="Genre",
y="acousticness",
hover_name='track_name',
color_discrete_sequence=['green'],
labels = dict(acousticness = 'Acousticness'),
title='Genre vs Acousticness')Figure 4: Genre vs Acousticness
In examining the relationship between song attribute and genre, my analysis shows fewer areas of significant difference than I initially expected. Ultimately, many of the values for each genre remained within a similar range of one another, including: song duration, liveness, valence, tempo, loudness, and instrumentalness. Folk, Grunge, and Hip-Hop/Rap were the only genres to demonstrate a significant deviation in song attributes than the other genres: Figures 2 and 4 demonstrated that Folk correlated with significantly lower energy and significantly higher acousticness values, Figure 1 demonstrated that Grunge correlated with significantly lower danceability values, and Figure 3 showed that Hip-Hop/Rap correlated with significantly higher speechiness values. These findings demonstrate that, save for a few exceptions, the majority of key attributes’ values are consistent across genres. For example, on average, one cannot assume or predict a significant difference in valence range between Rock and Pop songs on the Spotify top 1525 most streamed song list.
Furthermore, in examining the relationship between song era and attribute, my analysis demonstrated no significant difference across attributes for each era. This was also inconsistent with my expectations, as I had assumed that there would be key differences between average song tempo values across different eras, for example. However, my findings show that the danceability, energy, valence, tempo, speechiness, loudness, instrumentalness, liveness, duration, and acousticness values all remained within a similar range of one another across eras – one cannot predict or assume any significant correlation between song era and attributes.
Question 2: What is the relationship between key attributes and song popularity?
My second sub-question was concerned with the relationship between key attributes and song popularity. To answer this question, I broke up each song on the data set into its ten defining audio features, and plotted each value against the number of Spotify streams the song has.
The following plots include those from my general Notebook which displayed any mild indication of correlation:
# Instrumentalness
fig = px.scatter(final_df,
x="instrumentalness",
y="Streams", trendline = 'ols', trendline_color_override="red",
hover_name='track_name',
log_y=True,
labels = dict(instrumentalness = 'Instrumentalness', Streams = 'Streams (Billions)'),
title='Instrumentalness vs Spotify Streams')
fig.show()Figure 5: Instrumentalness vs Streams
# Loudness
fig = px.scatter(final_df,
x="loudness",
y="Streams", trendline = 'ols', trendline_color_override="red",
hover_name='track_name',
log_y=True,
labels = dict(loudness = 'Loudness (dB)', Streams = 'Streams (Billions)'),
title='Loudness vs Spotify Streams')
fig.show()Figure 6: Loudness vs Streams
In examining the relationship between key attributes and song popularity, my analysis does not show any significant correlation. Figure 5 demonstrates that stream counts are slightly negatively associated with instrumentalness values, and Figure 6 shows that stream counts are slightly positively associated with loudness values. However, neither are significant enough to be considered predictive. Danceability, energy, valence, tempo, speechiness, liveness, duration, and acousticness all remained within similar ranges for each song, despite differences in popularity. This finding was the most inconsistent with my expectations and came as a great surprise to me, indicating that overall, there is no correlation between key attributes and song popularity.
Question 3: What is the relationship between key attributes and song popularity?
For my third subquestion, I was concerned with the relationship between song genre and song popularity, as well as the relationship between song era and song popularity. To answer this, I will graph a list of average streams pertaining to each song genre, as well as the era it was released in. Furthermore, I will follow up with a box plot, which will allow me to determine key values beyond just the average.
# Bar Graph of average values
genres_list=['Pop', 'Hip-Hop/Rap', 'R&B/Soul', 'Rock', 'EDM', 'Reggaeton', 'Indie', 'Reggae', 'Country', 'KPop', 'Grunge', 'Trap', 'Holiday', 'Disco', 'Folk']
# Manually inputted
average_streams_genre_list=[963281427.141876, 927854991.882562, 923998844.351351, 884936775.769231, 953315868.859259, 784786176.741259, 1002932660.5, 881224482.769231, 751482093.058824, 843707848.333333, 871077111.75, 1060559587.0, 948618947.833333, 928109928.2, 945638977.666667]
# Graph
px.bar(x=genres_list,
y=average_streams_genre_list,
hover_name=average_streams_genre_list,
color_discrete_sequence=['orange'],
labels = dict(x = 'Genre', y= 'Streams (Billions)'),
title= 'Average Streams of Each Genre')Figure 7: Average Streams of Each Genre
In examining the relationship between song genre and song popularity, as measured by average number of streams, my analysis did not find a significant correlation between these two variables for the majority of genres. However, this figure demonstrates that Trap and Indie have the highest values with a difference of approximately 77 million streams higher than the average performing song, and Country and Reggaeton have the lowest values with a difference of approximately 175 million streams fewer than the average performing song. With average stream counts ranging from 750 million to 1.06 billion, these findings demonstrate that there is no one genre that is significantly correlated with the best performance, but Country and Reggaeton are significantly correlated with lower average stream counts.
# Box Plot
px.box(final_df,
x="Genre",
y="Streams",
hover_name='track_name',
color_discrete_sequence=['orange'],
title='Genre vs Streams')Figure 8: Genres vs Streams
The box plot dives deeper into the data, as it shows the minimum, lower quartile, median, upper quartile, and maximum values. This gives further context into the range of values each genre receives in terms of streams. For instance, KPop has the lowest median value, despite its average counts being larger than both reggaeton and country.
# Bar Graph of Average Values
eras_list=['Late 1960s', 'Early 1970s', 'Late 1970s', 'Early 1980s', 'Late 1980s', 'Early 1990s', 'Late 1990s', 'Early 2000s', 'Late 2000s', 'Early 2010s', 'Late 2010s', 'Early 2020s']
# Manually inputted from data gathered in above cell
Average_Streams_Era_List=[777114674.8125, 720830732.894737, 868610360.096774, 882998730.941176, 854068747.24, 902084667.555556, 818445800.638889, 894726941.823529, 778915441.903846, 923113497.338645, 973673553.34891, 920777993.722222]
# Graph
px.bar(x=eras_list,
y=Average_Streams_Era_List,
hover_name=Average_Streams_Era_List,
color_discrete_sequence=['orange'],
labels = dict(x = 'Era', y= 'Streams (Billions)'),
title= 'Average Streams of Each Era')Figure 9: Average Streams of Each Era
In examining song era and song popularity, as measured by average number of streams, my analysis did not find a significant correlation between these two variables for any era. However, this figure shows that the late 2010s has the highest value with a difference of approximately 50 million streams higher than the average performing song, and the early 1970s has the lowest value with a difference of approximately 200 million streams fewer than the average performing song. With average stream counts ranging from 720 million to 975 million with very little difference between, these findings demonstrate that there is no one era that is significantly correlated with the best performance or with the worst performance.
# Box Plot
px.box(final_df,
x="Era",
y="Streams",
color_discrete_sequence=['orange'],
hover_name='track_name',
title='Era vs Streams')Figure 10: Era vs Streams
The box plot dives deeper into the data, showing the minimum, lower quartile, median, upper quartile, and maximum. This gives further context into the range of streaming counts each era tends to receive in terms. For instance, the early 1990s has the highest median and upper quartile values, despite its average stream count being lower than three separate eras.
The results in this section were surprising to me, as I expected genres to have a bigger impact on song popularity than my analysis showed, so I further explored the impact that song genres have on song popularity, this time as measured by the “cap” or highest number of streams per song.
Follow up: To what extent does a stream cap exist based on genre? Based on era?
# Bar Graph of average values
genres_list=['Pop', 'Hip-Hop/Rap', 'R&B/Soul', 'Rock', 'EDM', 'Reggaeton', 'Indie', 'Reggae', 'Country', 'KPop', 'Grunge', 'Trap', 'Holiday', 'Disco', 'Folk']
# Manually inputted from data gathered in above cell
max_streams_genre_list=[3562543890.0, 2808096550.0, 3703895074.0, 2594040133.0, 2591224264.0, 1763363713.0, 2557975762.0, 1514385306.0, 1271967645.0, 1695000896.0, 1691731327.0, 2727403487.0, 1454664422.0, 1583955910.0, 2009094673.0]
# Graph
px.bar(x=genres_list,
y=max_streams_genre_list,
hover_name=max_streams_genre_list,
color_discrete_sequence=['lightskyblue'],
labels = dict(x = 'Genre', y= 'Streams (Billions)'),
title= 'Max Streams Vs Genre')Figure 11: Max Streams of Each Genre
I further explored the impact that song genres have on song popularity, this time as measured by the “cap” or highest number of streams per song. I found that my data aligned in four tiers: Pop and R&B/Soul had the highest maximum streams, with both surpassing 3.5 billion; Hip-Hop/Rap, Rock, EDM, Indie, and Trap encompass the second tier, with each surpassing 2.5 billion streams; Folk, Disco, Grunge, KPop, Reggae, Reggaeton encompass the third tier, with each surpassing 1.5 billion streams; and finally, Country and Holiday with below 1.5 billion streams. Together, my plots of the average and maximum stream numbers per genre allow me to conclude that where a genre “maxes out” their most streamed song is not necessarily predictive of how the genre performs overall. In other words, just because the most streamed song is R&B/Soul, one cannot assume that R&B/Soul songs performed better on average. In this way, my analysis demonstrates that, perhaps apart from Country and Reggaeton performing at a lower average, there is no significant relationship between song genre and popularity.
# Bar Graph of Cap Values
eras_list=['Late 1960s', 'Early 1970s', 'Late 1970s', 'Early 1980s', 'Late 1980s', 'Early 1990s', 'Late 1990s', 'Early 2000s', 'Late 2000s', 'Early 2010s', 'Late 2010s', 'Early 2020s']
# Manually inputted from data gathered in above cell
Max_Streams_Era_List=[1056974289.0, 1145900189.0, 2197010679.0, 1607156534.0, 1554702671.0, 1691731327.0, 1704625977.0, 1829992958.0, 1635864522.0, 2282771485.0, 3562543890.0, 3703895074.0]
# Graph
px.bar(x=eras_list,
y=Max_Streams_Era_List,
hover_name=Max_Streams_Era_List,
color_discrete_sequence=['lightskyblue'],
labels = dict(x = 'Era', y= 'Streams (Billions)'),
title= 'Max Streams Vs Era')Figure 12: Max Streams of Each Era
I further explored the impact that song era has on song popularity, this time as measured by the “cap” or highest number of streams per song. I found that my data did show a correlation between these two variables, with a general positive correlation between the most recent era and popularity. Only the late 1970s and early 2000s break with this trend. Notably, there is a wide disparity between the late 2010s and early 2020s with other eras - with 3.5 and 3.7 billion streams, respectively. Those stream maximums are more than twice the maximum of all other eras, except for the late 1970s and the early 2010s.
Final Thoughts / Bigger Picture
In response to my principal research question – “What key factors might differentiate songs as more ‘streamable’ than others?” – my findings demonstrate a much more limited relationship than anticipated. Ultimately, factors such as Spotify’s song attribute metrics, song genre, and song era have limited to no correlation with the “streamability” of songs. The only relationships of significance involve genre as a key factor of streams, with Country and Reggaeton showing correlation with lower average stream counts. However, this does not necessarily rule out songs of those genres as “unstreamable,” as from the exploratory data analysis, we can see that the most represented artist and most represented album on Spotify’s 1525 top streamed songs list are a Reggaeton artist and a Reggaeton album. Ultimately, these findings can be used to demonstrate that no factors investigated are significant enough to become a predictor of song virality. Any artists, producers, or influencers/content creators looking to create the next big hit will not be able to isolate and purposefully include factors that lend themselves to “streamability” – based on my findings, it is not possible to reverse-engineer a most “streamable” song based on key song attributes, genre, or time era.
From a big picture perspective, although the premise of being able to design a most “streamable” song based on key factors is interesting, my findings showing no correlation has a large silver lining: if producers were able to meaningfully reverse-engineer songs and only incorporate aspects that increase correlation with song popularity, so many songs would sound the same. In a field where streaming numbers have become increasingly important, it would be a great shame to have large swaths of the industry consistently tapping into the same factors to get to the top. Growing artists would fear the lost revenue that could come from straying from the path towards “streamability” and existing music legends might be pressured to release same-sounding music to stay at the top – even more than this already happens today. In the end, the limited relationship demonstrated between key factors and song popularity might just be what allows us to enjoy creative and diverse musical talents instead of a mainly homogeneous experience.